Language Models, Smoothing, and IDF Weighting
نویسندگان
چکیده
In this paper, we investigate the relationship between smoothing in language models and idf weights. Language models regard the relative within-document-frequency and the relative collection frequency; idf weights are very similar to the latter, but yield higher weights for rare terms. Regarding the correlation between the language model parameters and relevance for two test collections, we find that the idf type of weighting seems to be more appropriate. Based on the observed correlation, we devise empirical smoothing as a new type of term weighting for language models, and retrieval experiments confirm the general applicability of our method. Finally, we show that the most appropriate form of describing the relationship between the language model parameters and relevance seems to be a product form, which confirms a language model proposed before.
منابع مشابه
Cumulative Progress in Language Models for Information Retrieval
The improvements to ad-hoc IR systems over the last decades have been recently criticized as illusionary and based on incorrect baseline comparisons. In this paper several improvements to the LM approach to IR are combined and evaluated: Pitman-Yor Process smoothing, TF-IDF feature weighting and modelbased feedback. The increases in ranking quality are significant and cumulative over the standa...
متن کاملAxiomatic Analysis of Smoothing Methods in Language Models for Pseudo-relevance Feedback by Hussein Hazimeh Thesis
Pseudo-Relevance Feedback (PRF) is an important general technique for improving retrieval effectiveness without requiring any user effort. Several state-of-the-art PRF models are based on the language modeling approach where a query language model is learned based on feedback documents. In all these models, feedback documents are represented with unigram language models smoothed with a collecti...
متن کاملPart of Speech Based Term Weighting for Information Retrieval
Automatic language processing tools typically assign to terms so-called ‘weights’ corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informati...
متن کاملA novel term weighting scheme based on discrimination power obtained from past retrieval results
Term weighting for document ranking and retrieval has been an important research topic in information retrieval for decades. We propose a novel term weighting method based on a hypothesis that a term’s role in accumulated retrieval sessions in the past affects its general importance regardless. It utilizes availability of past retrieval results consisting of the queries that contain a particula...
متن کاملRecovering Trace Links for Sysml Models Using Vsm-based Information Retrieval
Automated traceability recovery utilizing information retrieval techniques has been recognized as important for effective software development. In this paper, we discuss two approaches for augmenting the vector space model (VSM). The first approach employs document identifiers of a term, indicating where the term has been found, and a contextsensitive retrieval strategy that uses these identifi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010